Skip to content

add: ModelOpt Launcher for Slurm job submission#1031

Open
ChenhanYu wants to merge 12 commits intomainfrom
chenhan/modelopt-launcher
Open

add: ModelOpt Launcher for Slurm job submission#1031
ChenhanYu wants to merge 12 commits intomainfrom
chenhan/modelopt-launcher

Conversation

@ChenhanYu
Copy link
Collaborator

@ChenhanYu ChenhanYu commented Mar 12, 2026

Add launcher/ module with launch.py that submits quantization, training, and evaluation jobs to Slurm clusters via nemo-run. Produces identical code/ layout as nmm-sandbox's slurm.py so the same YAML configs work in both. Includes Megatron-LM and Model-Optimizer as submodules.

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced ModelOpt Launcher for submitting quantization, training, and evaluation jobs to Slurm clusters or running locally via Docker.
    • Added YAML-based job configuration with multi-task pipeline support and global variable interpolation.
    • Included example workflows for Qwen3-8B quantization and EAGLE3 speculative decoding.
    • Provided configurable Slurm and execution environment defaults.
  • Documentation

    • Added comprehensive README with quick start, environment setup, and configuration guidance.
    • Added advanced guide detailing launcher architecture and integration patterns.

Add launcher/ module with launch.py that submits quantization, training,
and evaluation jobs to Slurm clusters via nemo-run. Produces identical
code/ layout as nmm-sandbox's slurm.py so the same YAML configs work
in both. Includes Megatron-LM and Model-Optimizer as submodules.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@ChenhanYu ChenhanYu requested a review from a team as a code owner March 12, 2026 19:46
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces a comprehensive ModelOpt launcher framework for submitting quantization, training, and evaluation jobs to Slurm clusters or Docker environments. Includes core orchestration logic, configuration models, factory-based task creation, example YAML pipelines for Qwen3-8B, and shell/Python utilities for data synthesis, model quantization, and speculative decoding workflows.

Changes

Cohort / File(s) Summary
Project Configuration & Initialization
.gitmodules, launcher/__init__.py, launcher/pyproject.toml
Registers git submodules (Megatron-LM, Model-Optimizer), establishes Apache-2.0 license, module docstring, and project dependencies (nemo-run, pyyaml) requiring Python 3.10+.
Core Launcher Framework
launcher/launch.py, launcher/core.py, launcher/slurm_config.py
Implements Slurm/Docker job orchestration with CLI entrypoint, environment defaults, YAML-driven SandboxTask/SandboxPipeline models, factory registry for dynamic configuration, dual executor builders (Slurm SSH, Docker GPU), global variable interpolation, and experiment metadata generation.
Git Submodule References
launcher/modules/Megatron-LM, launcher/modules/Model-Optimizer
Records pinned subproject commit hashes for reproducible dependency versions.
Documentation
launcher/README.md, launcher/ADVANCED.md
User guide covering quick start, YAML format, multi-task pipelines, parameter overrides, and environment variables; advanced guide detailing architecture, factory system, metadata, and Claude Code integration workflows.
Example Configurations
launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yaml, launcher/Qwen/Qwen3-8B/hf_offline_eagle3.yaml
YAML-based job definitions: NVFP4 quantization with 4-GPU tensor parallelism; multi-task EAGLE3 speculative decoding pipeline (data synthesis, hidden state dump, offline training, benchmark) with 4 sequential Slurm jobs.
Utility Scripts - EAGLE3 Workflow
launcher/common/eagle3/dump_offline_data.sh, launcher/common/eagle3/offline_training.sh
Shell scripts orchestrating EAGLE3 offline training: environment setup, SLURM task detection, model data collection via trtllm-llmapi-launch, and training with error handling.
Utility Scripts - Quantization
launcher/common/megatron-lm/quantize/quantize.sh
Multi-step Megatron-LM quantization workflow (quantize, eval, convert, export) with environment setup and centralized error trapping.
Utility Scripts - Service & Data Generation
launcher/common/service_utils.sh, launcher/common/query.py
Bash MPI/Slurm environment utilities with rank-aware error handling and GPU detection; Python LLM client for data synthesis via OpenAI-compatible API with dataset loading, sharding, and concurrency.
Inference Launchers
launcher/common/tensorrt-llm/query.sh, launcher/common/vllm/query.sh, launcher/common/specdec_bench/quick_check.sh
Server orchestration and query execution: TensorRT-LLM and vLLM with background launch, health polling, and query via common/query.py; speculative decoding benchmark wrapper script.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'add: ModelOpt Launcher for Slurm job submission' accurately describes the main change: adding a new launcher module for submitting jobs to Slurm clusters, which aligns with the primary objective of the PR.
Security Anti-Patterns ✅ Passed No security anti-patterns detected: no torch.load(weights_only=False), numpy.load(allow_pickle=True), hardcoded trust_remote_code=True, eval/exec on untrusted input, or #nosec comments. All dependencies use permissive licenses.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chenhan/modelopt-launcher
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to use ModelOpt from the repo root instead of adding additional submodule?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I iterated several versions. Other solutions can be relative import or soft reference. The benefit will be no to submodule ModelOpt.

Cons:

  • relative import; the nemo-run packager cannot handle this well. So in the code/ section (nemo-run copies the entire code for every experiment), modelopt source code cannot quite maintain its original structure. In the internal sandbox, ModelOpt is in modules/Model-Optimizer. So if it is different here, we need to fix a lot of path difference.
  • soft reference; this should be doable. Of course, users will need to create the soft reference. I also haven't tested fully.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update; I change it to soft-reference. so gitclone should keep this relative symlink. Even if it is accidentally deleted, launch.py will create it.

@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.09%. Comparing base (a56b6f3) to head (410de11).
⚠️ Report is 23 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1031      +/-   ##
==========================================
- Coverage   71.73%   70.09%   -1.65%     
==========================================
  Files         211      221      +10     
  Lines       23949    25541    +1592     
==========================================
+ Hits        17181    17902     +721     
- Misses       6768     7639     +871     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.gitmodules:
- Around line 4-6: Remove the self-referential submodule entry for
"launcher/modules/Model-Optimizer" in .gitmodules (the block with path =
launcher/modules/Model-Optimizer and url =
https://github.com/NVIDIA/Model-Optimizer.git); instead rely on the top-level
Model-Optimizer working tree for packaging/mounting and remove any code that
updates the nested gitlink for launcher/modules/Model-Optimizer so you no longer
create or expect a nested checkout.
- Around line 1-3: The .gitmodules entry for the submodule at path
"launcher/modules/Megatron-LM" currently points to the personal fork URL
"https://github.com/AAnoosheh/Megatron-LM.git"; update that submodule URL to the
canonical upstream "https://github.com/NVIDIA/Megatron-LM.git" (and then run git
submodule sync && git submodule update --init --recursive) so the
launcher/modules/Megatron-LM submodule references NVIDIA/Megatron-LM instead of
the personal fork.

In `@launcher/launch.py`:
- Around line 87-89: The launcher sets container_mounts to mount the host cache
at /hf-local (variable container_mounts, local hf_local), but the code still
defaults HF_HOME to /hf-cache so the mounted cache is ignored; update places
that set the HF_HOME default (where container_mounts is None and the other
occurrences at the blocks referenced around lines 111-122 and 330-336) to align
HF_HOME with hf_local (i.e., use the hf_local value or "/hf-local" as the
HF_HOME default) so the injected environment points HuggingFace to the actual
mounted path.
- Around line 377-379: The code sets NEMORUN_HOME via
run.config.set_nemorun_home but still writes/reads metadata.json relative to the
process CWD; update all metadata file operations to use the configured NeMo Run
home instead of os.getcwd(): replace os.getcwd()/bare relative paths with
run.config.get_nemorun_home() (or run.config.nemorun_home) when constructing
paths for metadata.json and any related experiment directories (e.g., where
metadata is written around run.config.set_nemorun_home and the similar block at
the other location), ensuring both reads and writes use the same configured
home.
- Around line 407-425: The issue is that DEFAULT_LOCAL_ENV and DEFAULT_SLURM_ENV
are applied after task_env, causing task-level environment keys to be
overwritten; in the block that builds task_env (around task.environment,
hf_local, get_docker_executor and get_slurm_executor) apply the defaults before
merging task values or merge them using defaults as base (e.g., start with a
copy of DEFAULT_LOCAL_ENV/DEFAULT_SLURM_ENV then update with task.environment)
or use setdefault semantics so that explicit task.environment entries (HF_HOME,
HF_TOKEN, MLM_SKIP_INSTALL, LAUNCH_SCRIPT) override the launcher defaults;
adjust the update order where task_env.update(DEFAULT_LOCAL_ENV) /
task_env.update(DEFAULT_SLURM_ENV) currently occur so task values take
precedence.
- Around line 171-173: The code currently looks up a callable via globals()
using function_name and invokes it—replace this with an explicit allowlist
factory map (e.g., a dict mapping allowed factory names to their functions) and
use that map to resolve the factory instead of globals(); in the block that sets
function_name and slurm_config, validate that function_name is present in the
factory map, raise a clear error if not allowed, then call the resolved factory
with slurm_config (preserving the existing use of config_from_yaml["script"],
function_name, and slurm_config variables) to produce the slurm config.
- Around line 250-270: packager is using LAUNCHER_DIR for all include_pattern
entries, causing "services/*" and "tests/*" (which live at the repository root /
MODELOPT_ROOT) to be skipped; update the packager call so the relative_path list
maps the first five patterns to LAUNCHER_DIR and the last two patterns
("services/*", "tests/*") to MODELOPT_ROOT (i.e., make relative_path =
[LAUNCHER_DIR]*5 + [MODELOPT_ROOT]*2) so the include_pattern entries resolve to
the correct directories; keep the include_pattern list unchanged and only adjust
the relative_path mapping in the packager invocation.
- Around line 417-424: Replace direct uses of the internal Experiment._id:
implement a small helper get_experiment_id(exp) that obtains the experiment id
via supported public APIs (try exp.id, then exp.get_id(), then
exp.metadata.get("id") or exp.to_dict().get("id")) and raise a clear error
instructing to update nemo_run if none exist; then call get_experiment_id(exp)
instead of exp._id in get_docker_executor(...) and get_slurm_executor(...) (and
the other occurrences referenced) so the launcher no longer depends on the
private _id attribute.

In `@launcher/modules/Megatron-LM`:
- Line 1: Document why the launcher submodule uses the personal fork
https://github.com/AAnoosheh/Megatron-LM.git (branch
aanuosh…/modelopt-example-import-order) by adding a short rationale in the repo
(e.g., README or launcher packager docs): list the exact diffs/changes required
from the fork that the launcher depends on (identify files/functions/patches),
explain why those changes cannot live in NVIDIA/Megatron-LM as-is, state whether
and when the changes will be upstreamed or removed (upstream PR IDs or
timeline), and describe the maintenance plan for this fork (how it will be
rebased/merged with upstream security fixes and who owns it). Ensure references
to the submodule URL, branch name, and the launcher packager are included so
reviewers can locate the dependency and the specific modifications.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: af7e3654-7a2f-40c2-9b19-f85cc6e8f625

📥 Commits

Reviewing files that changed from the base of the PR and between 69c0d47 and d28acd3.

📒 Files selected for processing (6)
  • .gitmodules
  • launcher/__init__.py
  • launcher/launch.py
  • launcher/modules/Megatron-LM
  • launcher/modules/Model-Optimizer
  • launcher/pyproject.toml

Extract shared logic (dataclasses, executor builders, run loop, version
reporting) into core.py. Both launch.py and nmm-sandbox's slurm.py
import from core.py to avoid divergence. Add slurm_config.py with
generic env-var-driven factory, service scripts, Qwen3-8B PTQ example,
and README with usage, flags, and bug reporting instructions.

Verified: same YAML produces identical MMLU 0.736 on OCI-HSG and 0.719
locally via both slurm.py and launch.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (7)
launcher/services/megatron-lm/quantize/quantize.sh (2)

46-47: exit_handler is called with an argument it doesn't accept.

exit_handler in service_utils.sh takes no parameters, but it's called with $0. This is harmless but inconsistent with its definition.

🔧 Suggested fix
 # This function handles the exit status (fails the CI).
-exit_handler $0
+exit_handler
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/services/megatron-lm/quantize/quantize.sh` around lines 46 - 47, The
call to exit_handler in quantize.sh passes an unused argument ($0) even though
exit_handler (defined in service_utils.sh) accepts no parameters; update the
call in quantize.sh to invoke exit_handler with no arguments (replace
"exit_handler $0" with "exit_handler") so it matches the function signature and
removes the inconsistency while keeping behavior unchanged.

33-36: Remove or document unused CONVERT_EXE and EXPORT_EXE.

These variables are defined but never used. If they're placeholders for future workflow steps, add a comment; otherwise remove them to reduce confusion.

🔧 Suggested fix
 QUANTIZE_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/quantize.sh"
 MMLU_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/mmlu.sh"
-CONVERT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/convert.sh"
-EXPORT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/export.sh"
+# TODO: Add convert and export steps
+# CONVERT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/convert.sh"
+# EXPORT_EXE="bash modules/Megatron-LM/examples/post_training/modelopt/export.sh"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/services/megatron-lm/quantize/quantize.sh` around lines 33 - 36,
CONVERT_EXE and EXPORT_EXE are defined but never used; either remove their
definitions or document them as placeholders. Edit the file to either delete the
lines defining CONVERT_EXE and EXPORT_EXE, or add a brief comment above them
stating they are intentional placeholders for future steps (e.g., "placeholder
for future convert/export scripts") and keep the variables if needed; leave
QUANTIZE_EXE and MMLU_EXE unchanged. Ensure the chosen change keeps the script
consistent (no unused vars unless explicitly documented).
launcher/services/service_utils.sh (2)

24-25: Remove or use the FAIL variable.

FAIL is set in error_handler (line 34) but never read—only FAIL_EXIT is used. Either remove FAIL or export it if external scripts rely on it.

🔧 Suggested fix
-FAIL=0
 FAIL_EXIT=0

Or if needed externally:

-FAIL=0
+export FAIL=0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/services/service_utils.sh` around lines 24 - 25, The variable FAIL
is assigned but never read; update the code path around the error_handler and
variable declarations to either remove the unused FAIL variable or
export/consume it where intended: if other scripts rely on it, export FAIL
(e.g., export FAIL) and ensure error_handler sets it; otherwise delete FAIL and
only keep FAIL_EXIT. Locate the declarations FAIL and FAIL_EXIT and the
error_handler function to make the change.

50-54: Pin the diskcache version for reproducibility.

Installing diskcache without a version specifier can lead to non-deterministic builds and potential compatibility issues.

🔧 Suggested fix
 function util_install_extra_dep {
     if [[ "$mpi_local_rank" -eq 0 ]]; then
-        pip install diskcache
+        pip install 'diskcache>=5.6,<6.0'
     fi
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/services/service_utils.sh` around lines 50 - 54, The pip install
call in function util_install_extra_dep currently installs diskcache without a
version; update util_install_extra_dep to pin diskcache to a specific, tested
version (e.g., replace the pip install diskcache invocation with a pinned
install like pip install diskcache==<VERSION>) or read the version from a single
constant/variable (e.g., DISKCACHE_VERSION) so builds are deterministic; keep
the mpi_local_rank guard unchanged.
launcher/pyproject.toml (1)

6-9: Consider pinning pyyaml version for reproducibility.

nemo-run is pinned to a specific commit hash, but pyyaml is unpinned. For consistent, reproducible builds, consider pinning to a specific version (e.g., pyyaml>=6.0,<7.0 or an exact version).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/pyproject.toml` around lines 6 - 9, The dependency list in
pyproject.toml pins "nemo-run" to a commit but leaves "pyyaml" unpinned, risking
non-reproducible builds; update the dependencies array to pin "pyyaml" to a
specific range or exact version (for example change the "pyyaml" entry in the
dependencies list to something like "pyyaml>=6.0,<7.0" or an exact
"pyyaml==6.0.1") so that the dependencies block in pyproject.toml
deterministically resolves.
launcher/slurm_config.py (2)

58-61: Mutable default arguments in function signature.

container_mounts and srun_args use list literals as default values. In Python, mutable defaults are evaluated once at function definition time and shared across all calls, which can lead to unexpected behavior if the lists are modified.

🔧 Suggested fix
 def slurm_factory(
     host: str = os.environ.get("SLURM_HOST", ""),
     account: str = os.environ.get("SLURM_ACCOUNT", ""),
     partition: str = "batch",
     nodes: int = 1,
     ntasks_per_node: int = 1,
     gpus_per_node: int = 1,
     container: str = "nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5",
     modelopt_install_path: str = "/usr/local/lib/python3.12/dist-packages/modelopt",
-    container_mounts: list[str] = [
-        "{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")),
-    ],
-    srun_args: list[str] = ["--no-container-mount-home"],
+    container_mounts: list[str] | None = None,
+    srun_args: list[str] | None = None,
     array: str = None,  # noqa: RUF013
 ) -> SlurmConfig:
     """Generic Slurm factory — configure via environment variables or CLI overrides."""
+    if container_mounts is None:
+        container_mounts = [
+            "{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")),
+        ]
+    if srun_args is None:
+        srun_args = ["--no-container-mount-home"]
     return SlurmConfig(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/slurm_config.py` around lines 58 - 61, The defaults for
container_mounts and srun_args are currently mutable list literals; change their
default values to None in the function or constructor signature (e.g.,
container_mounts: Optional[list[str]] = None, srun_args: Optional[list[str]] =
None) and inside the function initialize them to fresh lists if None (create the
container_mounts list using
"{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")) and set
srun_args to ["--no-container-mount-home"]) so each call gets its own list and
avoids shared mutable state; update any references to these parameters
accordingly (look for container_mounts and srun_args in the function/class where
they are defined).

32-44: Type hints don't match default values.

Several fields have type hints (e.g., str, list[str]) but default to None. Consider using union types for clarity:

🔧 Suggested fix
 `@dataclass`
 class SlurmConfig:
-    host: str = None
+    host: str | None = None
     port: int = 22
-    account: str = None
+    account: str | None = None
     partition: str = "batch"
-    container: str = None
+    container: str | None = None
     modelopt_install_path: str = "/usr/local/lib/python3.12/dist-packages/modelopt"
-    container_mounts: list[str] = None
-    srun_args: list[str] = None
-    array: str = None
+    container_mounts: list[str] | None = None
+    srun_args: list[str] | None = None
+    array: str | None = None
     nodes: int = 1
     ntasks_per_node: int = 1
     gpus_per_node: int = 1
     local: bool = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/slurm_config.py` around lines 32 - 44, The annotated types for
several dataclass/config fields (host, account, container, container_mounts,
srun_args, array) conflict with their None defaults; update their type hints to
allow None (e.g., use Optional[str] for host/account/container/array and
Optional[list[str]] for container_mounts/srun_args) and add the necessary import
from typing (Optional) or use union syntax (str | None, list[str] | None) so the
annotation matches the default values; keep the existing defaults (None) and
leave other fields (port, nodes, ntasks_per_node, gpus_per_node,
modelopt_install_path, local, partition) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@launcher/core.py`:
- Around line 313-329: Remove the inline "# nosec" comments: delete the "# nosec
B404" on the "import subprocess" line and the "# nosec B603 B607" comments on
the subprocess.run calls that set the commit and branch variables (the lines
invoking git rev-parse for commit and branch). Keep the subprocess calls as-is
(list args, no shell=True, capture_output/text/timeout) and instead add to your
PR description or a nearby code comment a request for formal review by
`@NVIDIA/modelopt-setup-codeowners` with documented justification for why these
subprocess usages are safe.

In `@launcher/services/megatron-lm/quantize/quantize.sh`:
- Line 38: The assignment to MLM_EXTRA_ARGS incorrectly expands "${@}" into a
scalar; change it to either capture all args as a single string using "$*" (if
you want one string) or make MLM_EXTRA_ARGS an array by declaring
MLM_EXTRA_ARGS=("$@") (if you need separate elements), and update subsequent
uses of MLM_EXTRA_ARGS accordingly so they treat it as a string or an array as
intended.

In `@launcher/services/service_utils.sh`:
- Around line 59-62: Replace the runtime append of __version__ from the script:
stop modifying modules/Model-Optimizer/modelopt/__init__.py at runtime and
instead write/read a dedicated version file (e.g.,
modules/Model-Optimizer/VERSION) or inject the version at build-time; remove the
block that checks mpi_local_rank and echoes "__version__ = '1.0.0'"; implement a
safe creation/update step that checks for an existing VERSION file (or existing
__version__ entry) before writing and ensure the write happens in a
build/pre-deploy step or by a single global coordinator (not per-node
mpi_local_rank==0) to avoid race conditions and git dirty state.

---

Nitpick comments:
In `@launcher/pyproject.toml`:
- Around line 6-9: The dependency list in pyproject.toml pins "nemo-run" to a
commit but leaves "pyyaml" unpinned, risking non-reproducible builds; update the
dependencies array to pin "pyyaml" to a specific range or exact version (for
example change the "pyyaml" entry in the dependencies list to something like
"pyyaml>=6.0,<7.0" or an exact "pyyaml==6.0.1") so that the dependencies block
in pyproject.toml deterministically resolves.

In `@launcher/services/megatron-lm/quantize/quantize.sh`:
- Around line 46-47: The call to exit_handler in quantize.sh passes an unused
argument ($0) even though exit_handler (defined in service_utils.sh) accepts no
parameters; update the call in quantize.sh to invoke exit_handler with no
arguments (replace "exit_handler $0" with "exit_handler") so it matches the
function signature and removes the inconsistency while keeping behavior
unchanged.
- Around line 33-36: CONVERT_EXE and EXPORT_EXE are defined but never used;
either remove their definitions or document them as placeholders. Edit the file
to either delete the lines defining CONVERT_EXE and EXPORT_EXE, or add a brief
comment above them stating they are intentional placeholders for future steps
(e.g., "placeholder for future convert/export scripts") and keep the variables
if needed; leave QUANTIZE_EXE and MMLU_EXE unchanged. Ensure the chosen change
keeps the script consistent (no unused vars unless explicitly documented).

In `@launcher/services/service_utils.sh`:
- Around line 24-25: The variable FAIL is assigned but never read; update the
code path around the error_handler and variable declarations to either remove
the unused FAIL variable or export/consume it where intended: if other scripts
rely on it, export FAIL (e.g., export FAIL) and ensure error_handler sets it;
otherwise delete FAIL and only keep FAIL_EXIT. Locate the declarations FAIL and
FAIL_EXIT and the error_handler function to make the change.
- Around line 50-54: The pip install call in function util_install_extra_dep
currently installs diskcache without a version; update util_install_extra_dep to
pin diskcache to a specific, tested version (e.g., replace the pip install
diskcache invocation with a pinned install like pip install
diskcache==<VERSION>) or read the version from a single constant/variable (e.g.,
DISKCACHE_VERSION) so builds are deterministic; keep the mpi_local_rank guard
unchanged.

In `@launcher/slurm_config.py`:
- Around line 58-61: The defaults for container_mounts and srun_args are
currently mutable list literals; change their default values to None in the
function or constructor signature (e.g., container_mounts: Optional[list[str]] =
None, srun_args: Optional[list[str]] = None) and inside the function initialize
them to fresh lists if None (create the container_mounts list using
"{}:/hf-local".format(os.environ.get("SLURM_HF_LOCAL", "/hf-local")) and set
srun_args to ["--no-container-mount-home"]) so each call gets its own list and
avoids shared mutable state; update any references to these parameters
accordingly (look for container_mounts and srun_args in the function/class where
they are defined).
- Around line 32-44: The annotated types for several dataclass/config fields
(host, account, container, container_mounts, srun_args, array) conflict with
their None defaults; update their type hints to allow None (e.g., use
Optional[str] for host/account/container/array and Optional[list[str]] for
container_mounts/srun_args) and add the necessary import from typing (Optional)
or use union syntax (str | None, list[str] | None) so the annotation matches the
default values; keep the existing defaults (None) and leave other fields (port,
nodes, ntasks_per_node, gpus_per_node, modelopt_install_path, local, partition)
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 56d56135-8c67-4a05-b578-e10120a5789f

📥 Commits

Reviewing files that changed from the base of the PR and between d28acd3 and f3d3020.

📒 Files selected for processing (8)
  • launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yaml
  • launcher/README.md
  • launcher/core.py
  • launcher/launch.py
  • launcher/pyproject.toml
  • launcher/services/megatron-lm/quantize/quantize.sh
  • launcher/services/service_utils.sh
  • launcher/slurm_config.py
✅ Files skipped from review due to trivial changes (1)
  • launcher/README.md

Comment on lines +313 to +329
import subprocess # nosec B404

try:
commit = subprocess.run( # nosec B603 B607
["git", "rev-parse", "--short", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()
branch = subprocess.run( # nosec B603 B607
["git", "rev-parse", "--abbrev-ref", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove # nosec comments per coding guidelines.

The coding guidelines state: "# nosec comments are not allowed as bypasses for security checks; contributors must request formal review from @NVIDIA/modelopt-setup-codeowners with documented justification instead."

The subprocess calls here are safe (list arguments, no shell=True, hardcoded commands), but the bypass comments should be removed and the pattern formally reviewed.

🔧 Suggested fix
 def _git_info(path):
     """Get git commit hash and branch for a directory."""
-    import subprocess  # nosec B404
+    import subprocess

     try:
-        commit = subprocess.run(  # nosec B603 B607
+        commit = subprocess.run(
             ["git", "rev-parse", "--short", "HEAD"],
             cwd=path,
             capture_output=True,
             text=True,
             timeout=5,
         ).stdout.strip()
-        branch = subprocess.run(  # nosec B603 B607
+        branch = subprocess.run(
             ["git", "rev-parse", "--abbrev-ref", "HEAD"],
             cwd=path,
             capture_output=True,
             text=True,
             timeout=5,
         ).stdout.strip()

As per coding guidelines: "Disallow use of '# nosec' comments to bypass Bandit security checks."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import subprocess # nosec B404
try:
commit = subprocess.run( # nosec B603 B607
["git", "rev-parse", "--short", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()
branch = subprocess.run( # nosec B603 B607
["git", "rev-parse", "--abbrev-ref", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()
import subprocess
try:
commit = subprocess.run(
["git", "rev-parse", "--short", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()
branch = subprocess.run(
["git", "rev-parse", "--abbrev-ref", "HEAD"],
cwd=path,
capture_output=True,
text=True,
timeout=5,
).stdout.strip()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/core.py` around lines 313 - 329, Remove the inline "# nosec"
comments: delete the "# nosec B404" on the "import subprocess" line and the "#
nosec B603 B607" comments on the subprocess.run calls that set the commit and
branch variables (the lines invoking git rev-parse for commit and branch). Keep
the subprocess calls as-is (list args, no shell=True,
capture_output/text/timeout) and instead add to your PR description or a nearby
code comment a request for formal review by `@NVIDIA/modelopt-setup-codeowners`
with documented justification for why these subprocess usages are safe.

Comment on lines +59 to +62
# Increase the modelopt version number manually
if [[ "$mpi_local_rank" -eq 0 ]]; then
echo "__version__ = '1.0.0'" >> ./modules/Model-Optimizer/modelopt/__init__.py
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Runtime modification of __init__.py is fragile and may cause race conditions.

Appending __version__ to the modelopt __init__.py at runtime:

  1. Modifies source files within a submodule, which can cause git dirty state
  2. May race with other MPI ranks despite the mpi_local_rank == 0 guard (multiple nodes each have a local rank 0)
  3. Doesn't check if __version__ already exists, causing duplicate entries on repeated runs

Consider moving version management to build-time or using a separate version file that's explicitly managed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/services/service_utils.sh` around lines 59 - 62, Replace the runtime
append of __version__ from the script: stop modifying
modules/Model-Optimizer/modelopt/__init__.py at runtime and instead write/read a
dedicated version file (e.g., modules/Model-Optimizer/VERSION) or inject the
version at build-time; remove the block that checks mpi_local_rank and echoes
"__version__ = '1.0.0'"; implement a safe creation/update step that checks for
an existing VERSION file (or existing __version__ entry) before writing and
ensure the write happens in a build/pre-deploy step or by a single global
coordinator (not per-node mpi_local_rank==0) to avoid race conditions and git
dirty state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on making top-level folder name generic like tools inside which we have launcher?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't quite get it. Could you be more clear?

@@ -0,0 +1,13 @@
script: services/megatron-lm/quantize/quantize.sh
args:
- --calib-dataset-path-or-name /hf-local/abisee/cnn_dailymail
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mention in README what hf-local is? Or for simplicity we can just make it HF path abisee/cnn_dailymail

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


# Increase the modelopt version number manually
if [[ "$mpi_local_rank" -eq 0 ]]; then
echo "__version__ = '1.0.0'" >> ./modules/Model-Optimizer/modelopt/__init__.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to modify ModelOpt version number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a WAR since we are mounting instead of pip installing. So the version will remain the same which will fail a lot of version check in modelopt.torch. Any suggestion?

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
launcher/core.py (4)

432-455: ⚠️ Potential issue | 🟠 Major

Preserve task-level env overrides when applying defaults.

On Line 443 and Line 455, defaults overwrite task-specific values. Reverse the merge precedence so YAML task env wins.

Suggested change
-                    task_env.update(default_local_env)
+                    task_env = {**default_local_env, **task_env}
@@
-                    task_env.update(default_slurm_env)
+                    task_env = {**default_slurm_env, **task_env}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/core.py` around lines 432 - 455, The current merges call
task_env.update(default_local_env) / task_env.update(default_slurm_env) which
lets defaults overwrite task-specific YAML values; instead, apply defaults first
then layer task-level env on top so task YAML wins—e.g., obtain the appropriate
default dict for the branch (for docker: default_local_env; for slurm:
default_slurm_env), copy/merge defaults first and then update with task_env (or
update a copy of defaults with task_env) before passing to build_docker_executor
/ build_slurm_executor; target the hf_local branch and the calls around
build_docker_executor, build_slurm_executor and the task_env.update(...) lines.

482-485: ⚠️ Potential issue | 🟠 Major

Write metadata under the configured NeMo Run home, not CWD-relative.

metadata.json is currently written to a relative path on Line 482, which can diverge from the configured NEMORUN_HOME set by the launcher.

Suggested change
-def run_jobs(
+def run_jobs(
     job_table,
@@
     modelopt_src_path=None,
     base_dir=None,
+    nemorun_home=None,
 ):
@@
-        metadata_path = os.path.join("experiments", experiment_title, exp._id, "metadata.json")
+        root = nemorun_home or os.environ.get("NEMORUN_HOME", os.getcwd())
+        metadata_path = os.path.join(root, "experiments", experiment_title, exp._id, "metadata.json")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/core.py` around lines 482 - 485, The code writes metadata.json to a
CWD-relative path using metadata_path built from "experiments", experiment_title
and exp._id; change it to write under the configured NeMo Run home instead—build
metadata_path by joining the launcher’s configured NeMo Run home variable (e.g.,
NEMORUN_HOME or the instance/home attribute used by the launcher) with
"experiments", experiment_title and exp._id, ensure os.makedirs uses
os.path.dirname(metadata_path) and then json.dump metadata to that path so files
are created under the configured NeMo Run home rather than the current working
directory.

436-450: ⚠️ Potential issue | 🟠 Major

Avoid relying on run.Experiment._id private API.

Using _id couples this code to an internal nemo_run attribute and risks breakage on dependency updates.

In the currently used nemo_run version, what is the supported public API (if any) for retrieving an Experiment identifier after creation, and is `_id` considered internal-only?

Also applies to: 477-482

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/core.py` around lines 436 - 450, Replace uses of the private
attribute run.Experiment._id with the official public Experiment identifier
accessor from the nemo_run API (for example exp.id or exp.get_id(), whichever
the library exposes) everywhere it's passed (e.g., to build_slurm_executor and
the local executor call sites referenced around build_slurm_executor and the
earlier local executor block). Locate all occurrences of exp._id (including the
other block noted) and switch them to the public accessor; if the library
requires calling a method at creation time, capture and reuse that returned id
instead of reading a private field. Ensure the code compiles and tests that rely
on the experiment id are updated to use the new accessor.

321-332: ⚠️ Potential issue | 🟠 Major

Remove Bandit bypass comments from subprocess calls.

# nosec on Line 321, Line 324, and Line 331 violates repo policy even when the subprocess usage is otherwise safe.

Suggested change
-    import subprocess  # nosec B404
+    import subprocess
@@
-        commit = subprocess.run(  # nosec B603 B607
+        commit = subprocess.run(
@@
-        branch = subprocess.run(  # nosec B603 B607
+        branch = subprocess.run(

As per coding guidelines, "Disallow use of '# nosec' comments to bypass Bandit security checks."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/core.py` around lines 321 - 332, Remove the "# nosec" comments on
the subprocess import and the subprocess.run calls (the lines that set commit
and branch), and instead make the calls explicitly safe: keep the argument lists
(["git", "rev-parse", ...]) to avoid shell=True, add check=True to both
subprocess.run calls and handle subprocess.CalledProcessError (or
subprocess.SubprocessError) to log/raise a clear error on failure, and
optionally validate git exists with shutil.which("git") before invoking; update
the code around variables commit and branch to use this checked/run-and-handle
pattern so Bandit suppression comments are no longer needed.
🧹 Nitpick comments (1)
launcher/core.py (1)

133-135: Fail fast with a clear error for unknown/missing YAML factory names.

Raw KeyError from factory_lookup[function_name] is hard to diagnose for config authors.

Suggested change
-    function_name = config_from_yaml["slurm_config"].pop("_factory_")
-    slurm_config = factory_lookup[function_name](**config_from_yaml["slurm_config"])
+    function_name = config_from_yaml["slurm_config"].pop("_factory_", None)
+    if not function_name:
+        raise ValueError(f"Missing '_factory_' in slurm_config for YAML: {yaml_file}")
+    try:
+        factory = factory_lookup[function_name]
+    except KeyError as exc:
+        raise ValueError(f"Unsupported slurm_config factory '{function_name}' in {yaml_file}") from exc
+    slurm_config = factory(**config_from_yaml["slurm_config"])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/core.py` around lines 133 - 135, The code currently assumes
config_from_yaml["slurm_config"]["_factory_"] exists and that
factory_lookup[function_name] will succeed, which yields an opaque KeyError for
config authors; update the block around function_name, factory_lookup, and
slurm_config to explicitly check for the "_factory_" key on
config_from_yaml["slurm_config"] and that function_name is present in
factory_lookup, and if not raise a clear ValueError (or custom exception) that
states the missing or unknown factory name and lists available keys from
factory_lookup so the user can correct their YAML; keep subsequent behavior of
calling factory_lookup[function_name](**config_from_yaml["slurm_config"]) and
preserving args = config_from_yaml.get("args", None).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@launcher/core.py`:
- Around line 432-455: The current merges call
task_env.update(default_local_env) / task_env.update(default_slurm_env) which
lets defaults overwrite task-specific YAML values; instead, apply defaults first
then layer task-level env on top so task YAML wins—e.g., obtain the appropriate
default dict for the branch (for docker: default_local_env; for slurm:
default_slurm_env), copy/merge defaults first and then update with task_env (or
update a copy of defaults with task_env) before passing to build_docker_executor
/ build_slurm_executor; target the hf_local branch and the calls around
build_docker_executor, build_slurm_executor and the task_env.update(...) lines.
- Around line 482-485: The code writes metadata.json to a CWD-relative path
using metadata_path built from "experiments", experiment_title and exp._id;
change it to write under the configured NeMo Run home instead—build
metadata_path by joining the launcher’s configured NeMo Run home variable (e.g.,
NEMORUN_HOME or the instance/home attribute used by the launcher) with
"experiments", experiment_title and exp._id, ensure os.makedirs uses
os.path.dirname(metadata_path) and then json.dump metadata to that path so files
are created under the configured NeMo Run home rather than the current working
directory.
- Around line 436-450: Replace uses of the private attribute run.Experiment._id
with the official public Experiment identifier accessor from the nemo_run API
(for example exp.id or exp.get_id(), whichever the library exposes) everywhere
it's passed (e.g., to build_slurm_executor and the local executor call sites
referenced around build_slurm_executor and the earlier local executor block).
Locate all occurrences of exp._id (including the other block noted) and switch
them to the public accessor; if the library requires calling a method at
creation time, capture and reuse that returned id instead of reading a private
field. Ensure the code compiles and tests that rely on the experiment id are
updated to use the new accessor.
- Around line 321-332: Remove the "# nosec" comments on the subprocess import
and the subprocess.run calls (the lines that set commit and branch), and instead
make the calls explicitly safe: keep the argument lists (["git", "rev-parse",
...]) to avoid shell=True, add check=True to both subprocess.run calls and
handle subprocess.CalledProcessError (or subprocess.SubprocessError) to
log/raise a clear error on failure, and optionally validate git exists with
shutil.which("git") before invoking; update the code around variables commit and
branch to use this checked/run-and-handle pattern so Bandit suppression comments
are no longer needed.

---

Nitpick comments:
In `@launcher/core.py`:
- Around line 133-135: The code currently assumes
config_from_yaml["slurm_config"]["_factory_"] exists and that
factory_lookup[function_name] will succeed, which yields an opaque KeyError for
config authors; update the block around function_name, factory_lookup, and
slurm_config to explicitly check for the "_factory_" key on
config_from_yaml["slurm_config"] and that function_name is present in
factory_lookup, and if not raise a clear ValueError (or custom exception) that
states the missing or unknown factory name and lists available keys from
factory_lookup so the user can correct their YAML; keep subsequent behavior of
calling factory_lookup[function_name](**config_from_yaml["slurm_config"]) and
preserving args = config_from_yaml.get("args", None).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ce5b1f59-c0be-4a83-9853-3cff952f9271

📥 Commits

Reviewing files that changed from the base of the PR and between f3d3020 and f7f9878.

📒 Files selected for processing (2)
  • launcher/core.py
  • launcher/launch.py

launch.py now only accepts pipeline=@ or --yaml. Update README with
--yaml vs pipeline=@ docs, useful flags, and bug reporting. Update
Qwen3-8B config to new --yaml format with job_name + pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
launcher/launch.py (2)

50-50: ⚠️ Potential issue | 🟠 Major

Default HF_HOME still appears inconsistent with /hf-local mounts.

get_default_env(EXPERIMENT_TITLE) yields HF_HOME=/<title>/hf-cache, while the Slurm factory mount target is /hf-local. That means cache reuse can be missed unless every task overrides HF_HOME.

🔧 Proposed fix
 DEFAULT_SLURM_ENV, DEFAULT_LOCAL_ENV = get_default_env(EXPERIMENT_TITLE)
+DEFAULT_SLURM_ENV["HF_HOME"] = "/hf-local"
+DEFAULT_LOCAL_ENV["HF_HOME"] = "/hf-local"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/launch.py` at line 50, The default HF_HOME produced by
get_default_env(EXPERIMENT_TITLE) is set to /<title>/hf-cache which does not
match the Slurm factory mount target /hf-local, causing cache reuse to be
missed; update get_default_env (or the DEFAULT_SLURM_ENV/DEFAULT_LOCAL_ENV
initialization) to set HF_HOME to the mounted path /hf-local (or derive HF_HOME
from the Slurm mount constant) so that DEFAULT_SLURM_ENV and DEFAULT_LOCAL_ENV
use the same HF_HOME as the Slurm mounts and tasks inherit the consistent cache
location.

52-62: ⚠️ Potential issue | 🟠 Major

services/* packaging may resolve from the wrong base directory.

include_pattern includes services/*, but relative_path is fully anchored to LAUNCHER_DIR. If services/ is at repo root, this silently excludes it from packaging.

🔧 Proposed fix
 packager = run.PatternPackager(
@@
-    relative_path=[LAUNCHER_DIR] * 6,
+    relative_path=[LAUNCHER_DIR] * 5 + [MODELOPT_ROOT],
 )
#!/bin/bash
set -euo pipefail

echo "== services directories in repository =="
fd -HI -td '^services$' .

echo
echo "== Does launcher/services exist? =="
if [ -d launcher/services ]; then
  echo "launcher/services exists"
else
  echo "launcher/services missing"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/launch.py` around lines 52 - 62, The packaging include_pattern lists
"services/*" while relative_path is set to [LAUNCHER_DIR] * 6, which will
exclude repo-root services if they live outside LAUNCHER_DIR; update the
run.PatternPackager invocation so the relative_path entry that pairs with the
"services/*" pattern points to the correct base (e.g. repo root) instead of
LAUNCHER_DIR — locate the run.PatternPackager call and change the corresponding
element in relative_path (or compute relative_path dynamically) so the
"services/*" pattern resolves from the actual services directory.
🧹 Nitpick comments (2)
launcher/launch.py (2)

74-76: Avoid unconditionally overriding caller-provided job_dir in local mode.

Line 87-Line 89 always rewrites job_dir when hf_local is set, which drops explicit user input.

🔧 Proposed refactor
 def launch(
     job_name: str = "01_job",
-    job_dir: str = os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments")),
+    job_dir: str | None = None,
@@
 ) -> None:
@@
-    if hf_local is not None:
-        job_dir = os.path.join(os.getcwd(), "local_experiments")
+    if hf_local is not None:
+        job_dir = job_dir or os.path.join(os.getcwd(), "local_experiments")
+    else:
+        job_dir = job_dir or os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments"))

Also applies to: 87-89

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/launch.py` around lines 74 - 76, The current behavior
unconditionally replaces the caller-provided job_dir when hf_local is true;
change this so job_dir is only auto-set when the caller didn't provide one: make
job_dir default to None in the function signature (job_dir: Optional[str] =
None) and initialize it from os.environ.get("SLURM_JOB_DIR",
os.path.expanduser("~/experiments")) only if job_dir is None, and likewise in
the hf_local branch update job_dir only when job_dir is None or empty; update
references to the job_dir variable accordingly (look for the job_dir parameter
and the hf_local handling block).

36-37: Consider using absolute imports for better package reusability.

Lines 36-37 use relative imports (core, slurm_config), which work for the documented script usage (uv run launch.py), but are fragile if the code is ever imported as a module or if entry points are added to pyproject.toml. Since launcher is a proper Python package with package infrastructure, absolute imports with fallback improve maintainability.

The proposed try-except approach is reasonable:

🔧 Suggested approach
-from core import SandboxPipeline, get_default_env, register_factory, run_jobs, set_slurm_config_type
-from slurm_config import SlurmConfig, slurm_factory
+try:
+    from launcher.core import (
+        SandboxPipeline,
+        get_default_env,
+        register_factory,
+        run_jobs,
+        set_slurm_config_type,
+    )
+    from launcher.slurm_config import SlurmConfig, slurm_factory
+except ImportError:
+    # Script-mode fallback (e.g., running from launcher/ directly)
+    from core import SandboxPipeline, get_default_env, register_factory, run_jobs, set_slurm_config_type
+    from slurm_config import SlurmConfig, slurm_factory
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/launch.py` around lines 36 - 37, Replace the fragile relative
imports in launch.py with absolute-package imports and a fallback to the
relative names: attempt to import SandboxPipeline, get_default_env,
register_factory, run_jobs, set_slurm_config_type from the top-level package
(e.g., launcher.core or package.core) and SlurmConfig, slurm_factory from the
package slurm_config, and if that fails in an ImportError, fall back to the
current relative imports; locate the import line that currently references core
and slurm_config and update it to a try/except ImportError that assigns the same
symbols (SandboxPipeline, get_default_env, register_factory, run_jobs,
set_slurm_config_type, SlurmConfig, slurm_factory) so callers elsewhere in the
file keep working regardless of whether the module is run as a script or
imported as a package.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@launcher/launch.py`:
- Line 50: The default HF_HOME produced by get_default_env(EXPERIMENT_TITLE) is
set to /<title>/hf-cache which does not match the Slurm factory mount target
/hf-local, causing cache reuse to be missed; update get_default_env (or the
DEFAULT_SLURM_ENV/DEFAULT_LOCAL_ENV initialization) to set HF_HOME to the
mounted path /hf-local (or derive HF_HOME from the Slurm mount constant) so that
DEFAULT_SLURM_ENV and DEFAULT_LOCAL_ENV use the same HF_HOME as the Slurm mounts
and tasks inherit the consistent cache location.
- Around line 52-62: The packaging include_pattern lists "services/*" while
relative_path is set to [LAUNCHER_DIR] * 6, which will exclude repo-root
services if they live outside LAUNCHER_DIR; update the run.PatternPackager
invocation so the relative_path entry that pairs with the "services/*" pattern
points to the correct base (e.g. repo root) instead of LAUNCHER_DIR — locate the
run.PatternPackager call and change the corresponding element in relative_path
(or compute relative_path dynamically) so the "services/*" pattern resolves from
the actual services directory.

---

Nitpick comments:
In `@launcher/launch.py`:
- Around line 74-76: The current behavior unconditionally replaces the
caller-provided job_dir when hf_local is true; change this so job_dir is only
auto-set when the caller didn't provide one: make job_dir default to None in the
function signature (job_dir: Optional[str] = None) and initialize it from
os.environ.get("SLURM_JOB_DIR", os.path.expanduser("~/experiments")) only if
job_dir is None, and likewise in the hf_local branch update job_dir only when
job_dir is None or empty; update references to the job_dir variable accordingly
(look for the job_dir parameter and the hf_local handling block).
- Around line 36-37: Replace the fragile relative imports in launch.py with
absolute-package imports and a fallback to the relative names: attempt to import
SandboxPipeline, get_default_env, register_factory, run_jobs,
set_slurm_config_type from the top-level package (e.g., launcher.core or
package.core) and SlurmConfig, slurm_factory from the package slurm_config, and
if that fails in an ImportError, fall back to the current relative imports;
locate the import line that currently references core and slurm_config and
update it to a try/except ImportError that assigns the same symbols
(SandboxPipeline, get_default_env, register_factory, run_jobs,
set_slurm_config_type, SlurmConfig, slurm_factory) so callers elsewhere in the
file keep working regardless of whether the module is run as a script or
imported as a package.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 07f18aab-eab7-4f3b-812c-8122649ca802

📥 Commits

Reviewing files that changed from the base of the PR and between f7f9878 and ad1f0d8.

📒 Files selected for processing (3)
  • launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yaml
  • launcher/README.md
  • launcher/launch.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • launcher/README.md
  • launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yaml

Move service scripts to common/ (query.py, query.sh, eagle3, specdec_bench,
megatron-lm quantize). Add Qwen3-8B EAGLE3 offline pipeline YAML. Add
ADVANCED.md with architecture docs and Claude Code workflows. Update
packager to include common/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

♻️ Duplicate comments (2)
launcher/common/megatron-lm/quantize/quantize.sh (1)

38-38: ⚠️ Potential issue | 🟡 Minor

Preserve CLI args correctly when exporting MLM_EXTRA_ARGS.

Line 38 assigns ${@} to a scalar, which collapses/splits arguments unexpectedly. Use "$*" for a single string export (or a true array if downstream expects elements).

🔧 Proposed fix
-export MLM_EXTRA_ARGS=${@}
+export MLM_EXTRA_ARGS="$*"
#!/bin/bash
set -euo pipefail
# Verify all MLM_EXTRA_ARGS assignments/usages in shell scripts.
rg -n -C2 '\bMLM_EXTRA_ARGS\b' --type=sh
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/common/megatron-lm/quantize/quantize.sh` at line 38, The export of
MLM_EXTRA_ARGS collapses positional parameters because it uses ${@}; change it
to preserve all CLI args as a single exported string by using quoted expansion:
export MLM_EXTRA_ARGS="$*" (or, if downstream expects separate elements, convert
to a true bash array with declare -a MLM_EXTRA_ARGS=( "$@" ) and adapt callers
accordingly); update any callers that assume an array to handle the chosen
representation.
launcher/common/service_utils.sh (1)

59-62: ⚠️ Potential issue | 🟠 Major

Avoid mutating submodule source files at runtime for versioning.

Line 61 appends to modules/Model-Optimizer/modelopt/__init__.py on each run, which can create duplicate __version__ entries, race across multi-node jobs, and dirty the working tree.

#!/bin/bash
set -euo pipefail
# Inspect the runtime append logic and current __version__ declarations.
sed -n '56,62p' launcher/common/service_utils.sh
rg -n '^__version__\s*=' modules/Model-Optimizer/modelopt/__init__.py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/common/service_utils.sh` around lines 59 - 62, The current logic in
service_utils.sh unconditionally appends "__version__ = '1.0.0'" when
mpi_local_rank == 0 which mutates the submodule source repeatedly; change this
to be idempotent or avoid editing the submodule: in the block that checks
mpi_local_rank, first test the target file for an existing __version__
declaration (grep -q '^__version__' ...) and if present use an in-place replace
(sed -i -E "s/^__version__.*/__version__ = '1.0.0'/") otherwise write the single
line; alternatively stop modifying the submodule entirely by exporting a
MODELOPT_VERSION env var or writing the version into a generated file outside
the submodule (instead of echoing into the submodule file). Ensure the
check/replace runs only when mpi_local_rank == 0 to preserve the single-writer
invariant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@launcher/common/eagle3/dump_offline_data.sh`:
- Around line 38-42: The command invocation in dump_offline_data.sh uses an
unquoted array expansion ${@}, which allows word splitting and globbing; update
the trtllm-llmapi-launch invocation to use the quoted expansion "$@" so argument
boundaries and special characters are preserved (locate the line invoking
trtllm-llmapi-launch in dump_offline_data.sh and replace ${@} with "$@").

In `@launcher/common/eagle3/offline_training.sh`:
- Line 19: The source path is wrong: in offline_training.sh the variable
SCRIPT_DIR points to launcher/common/eagle3/, but service_utils.sh lives one
level up; update the sourcing in offline_training.sh so it references the
correct location (e.g., resolve SCRIPT_DIR/../service_utils.sh or compute the
parent dir via dirname and then source service_utils.sh) — modify the source
line that currently uses SCRIPT_DIR to point to the parent directory where
service_utils.sh actually resides.
- Around line 39-40: Re-enable the final exit handler call so CI sees failures:
restore the commented-out call to exit_handler (call exit_handler $0) at the end
of offline_training.sh so the ERR trap's FAIL_EXIT flag (set earlier by the
trap) is evaluated and the script exits nonzero when failures occur; ensure the
existing ERR trap and FAIL_EXIT behavior from launcher/common/service_utils.sh
remain unchanged and that exit_handler is invoked with the script's $0 name.
- Around line 29-31: The script forwards user args to the downstream launcher
using an unquoted ${@}, which can break arguments with spaces; update the
invocation of launch_train.sh (the line calling bash
modules/Model-Optimizer/examples/speculative_decoding/launch_train.sh and
passing --model ${HF_MODEL_CKPT} ${@}) to pass user arguments as "$@" (and also
ensure HF_MODEL_CKPT is quoted if it can contain spaces) so all positional
parameters are preserved intact when invoking launch_train.sh.
- Line 22: The pip install call in offline_training.sh uses an unquoted
requirement specifier (huggingface-hub>=1.2.1) which can be misinterpreted by
the shell; update the pip install invocation to pass the requirement specifier
as a quoted string (e.g., wrap huggingface-hub>=1.2.1 in single or double
quotes) so the >= is preserved and pip receives the correct requirement.

In `@launcher/common/megatron-lm/quantize/quantize.sh`:
- Around line 35-36: The script defines CONVERT_EXE and EXPORT_EXE but never
runs them; update the quantize.sh flow to execute these stages (e.g., run
"$CONVERT_EXE" then "$EXPORT_EXE") at the appropriate point after prerequisites
(such as after quantization completes), check their exit codes and abort on
failure, and ensure the variables are executable/invoked with bash if needed;
reference the CONVERT_EXE and EXPORT_EXE variables and place the calls in the
main execution sequence (same area where other stage commands are invoked) so
the multi-stage pipeline actually runs.

In `@launcher/common/query.py`:
- Around line 46-52: The except block that sets early_termination to True is
currently creating a local variable, so the module-level flag never changes;
update the except handler in launcher/common/query.py to declare the
module-level variable by adding a global early_termination (or otherwise
reference the module-level symbol) before assigning early_termination = True so
the assignment mutates the shared flag used later in the termination check.
- Around line 102-106: The code mutates msg["content"] in-place which can cause
race conditions when rows are shared (e.g., dataset.map with num_proc>1);
instead create a shallow copy of the message before modifying and append that
copy to current_messages (referencing the variables msg, current_messages and
the enable_thinking branch in launcher/common/query.py) so the original dataset
row is never altered by adding " /no_think".

In `@launcher/common/specdec_bench/quick_check.sh`:
- Around line 24-27: The shell invocation in quick_check.sh uses an unquoted
array expansion `${@}` which allows word splitting and globbing; update the
command that calls `${TRTLLM_LAUNCH_SCRIPT} python3
modules/Model-Optimizer/examples/specdec_bench/run.py --model_dir
${HF_MODEL_CKPT} --tokenizer ${HF_MODEL_CKPT} ${@}` to use the quoted expansion
`"$@"` so that each original argument to the script is preserved exactly (no
splitting or globbing) when passed through.

In `@launcher/common/tensorrt-llm/query.sh`:
- Line 75: The QUERY_ARGS array initialization uses unquoted expansions which
can split words; update the QUERY_ARGS assignment to quote the variable
expansions for TASK_ID and TASK_COUNT so the array elements remain single items
(i.e., use quoted "${TASK_ID}" and "${TASK_COUNT}" when constructing QUERY_ARGS)
— modify the QUERY_ARGS=(--shard-id-begin ${TASK_ID} --shard-id-step
${TASK_COUNT}) line accordingly.
- Around line 113-115: Replace the unsafe eval invocation that builds a single
interpolated string in the cmd variable and runs eval $cmd with a proper command
array and direct execution to avoid shell injection; construct an array (e.g.,
cmd_arr) containing "python", "common/query.py", "http://localhost:8000/v1",
"$MODEL", and the elements of QUERY_ARGS (use "${QUERY_ARGS[@]}") and then run
it with "${cmd_arr[@]}", removing the eval and quoting variables appropriately
so MODEL and QUERY_ARGS entries are passed as individual arguments.

In `@launcher/common/vllm/query.sh`:
- Line 81: The array initialization QUERY_ARGS=(--shard-id-begin ${TASK_ID}
--shard-id-step ${TASK_COUNT}) should quote the variable expansions to avoid
word splitting; update the QUERY_ARGS assignment to use "--shard-id-begin"
"${TASK_ID}" and "--shard-id-step" "${TASK_COUNT}" (referencing QUERY_ARGS,
TASK_ID, TASK_COUNT) so each option and its value remain single array elements.
- Around line 121-123: The script builds a single interpolated string in cmd and
runs it with eval, which is unsafe; instead construct and invoke the command
without eval by calling python directly with separate arguments (use the script
name and the URL, $MODEL and the QUERY_ARGS array as distinct parameters) so
word-splitting and injection are avoided—replace the cmd variable + eval usage
in launcher/common/vllm/query.sh with a direct invocation of python
common/query.py passing "http://localhost:8000/v1", $MODEL, and
"${QUERY_ARGS[@]}" (or equivalent positional-argument array expansion) so no
eval is required.

In `@launcher/launch.py`:
- Around line 52-63: The include_pattern passed to run.PatternPackager (variable
packager) contains a non-existent "services/*" entry which will never match
files when resolved against LAUNCHER_DIR; remove that pattern or replace it with
the correct existing directory path (e.g., the actual services folder name if it
was renamed) so the packager includes the intended files. Locate the
run.PatternPackager call (packager variable) in launcher/launch.py and either
delete the "services/*" element from include_pattern or update it to the correct
directory name that exists in the repo, then run the packager to verify matches.

In `@launcher/Qwen/Qwen3-8B/hf_offline_eagle3.yaml`:
- Around line 76-95: task_2 writes the draft model to --output_dir
/scratchspace/eagle3 while task_3 reads --draft_model_dir /scratchspace/export;
update one of them so both point to the same directory (e.g., change task_3's
--draft_model_dir to /scratchspace/eagle3 or change task_2's --output_dir to
/scratchspace/export) so task_3 can consume the draft produced by task_2; ensure
you update the corresponding arg in task_2 (the --output_dir setting) or task_3
(the --draft_model_dir argument) in the YAML.

---

Duplicate comments:
In `@launcher/common/megatron-lm/quantize/quantize.sh`:
- Line 38: The export of MLM_EXTRA_ARGS collapses positional parameters because
it uses ${@}; change it to preserve all CLI args as a single exported string by
using quoted expansion: export MLM_EXTRA_ARGS="$*" (or, if downstream expects
separate elements, convert to a true bash array with declare -a MLM_EXTRA_ARGS=(
"$@" ) and adapt callers accordingly); update any callers that assume an array
to handle the chosen representation.

In `@launcher/common/service_utils.sh`:
- Around line 59-62: The current logic in service_utils.sh unconditionally
appends "__version__ = '1.0.0'" when mpi_local_rank == 0 which mutates the
submodule source repeatedly; change this to be idempotent or avoid editing the
submodule: in the block that checks mpi_local_rank, first test the target file
for an existing __version__ declaration (grep -q '^__version__' ...) and if
present use an in-place replace (sed -i -E "s/^__version__.*/__version__ =
'1.0.0'/") otherwise write the single line; alternatively stop modifying the
submodule entirely by exporting a MODELOPT_VERSION env var or writing the
version into a generated file outside the submodule (instead of echoing into the
submodule file). Ensure the check/replace runs only when mpi_local_rank == 0 to
preserve the single-writer invariant.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f9ecbe27-9ea5-4b76-bf41-3cbb1f0dcf1c

📥 Commits

Reviewing files that changed from the base of the PR and between ad1f0d8 and 8e08365.

📒 Files selected for processing (12)
  • launcher/ADVANCED.md
  • launcher/Qwen/Qwen3-8B/hf_offline_eagle3.yaml
  • launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yaml
  • launcher/common/eagle3/dump_offline_data.sh
  • launcher/common/eagle3/offline_training.sh
  • launcher/common/megatron-lm/quantize/quantize.sh
  • launcher/common/query.py
  • launcher/common/service_utils.sh
  • launcher/common/specdec_bench/quick_check.sh
  • launcher/common/tensorrt-llm/query.sh
  • launcher/common/vllm/query.sh
  • launcher/launch.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • launcher/Qwen/Qwen3-8B/megatron_lm_ptq.yaml

# limitations under the License.

SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
source ${SCRIPT_DIR}/service_utils.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Verify service_utils.sh locations and the current source statement.
fd 'service_utils\.sh$'
sed -n '18,22p' launcher/common/eagle3/offline_training.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 304


🏁 Script executed:

#!/bin/bash
# Check all service_utils.sh files and directory structure
echo "=== All service_utils.sh files ==="
fd 'service_utils\.sh$' -x echo {}

echo ""
echo "=== Directory structure under launcher/common/ ==="
ls -la launcher/common/ | head -20

echo ""
echo "=== Contents of launcher/common/eagle3/ ==="
ls -la launcher/common/eagle3/ 2>/dev/null || echo "Directory not found"

echo ""
echo "=== Verify full path of offline_training.sh ==="
find launcher -name "offline_training.sh" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 1106


Fix service_utils.sh source path.

Line 19 attempts to source ${SCRIPT_DIR}/service_utils.sh, but since the script is located at launcher/common/eagle3/offline_training.sh, SCRIPT_DIR resolves to launcher/common/eagle3/. The actual service_utils.sh file is at launcher/common/service_utils.sh (one level up), so the source command will fail. Change to:

-source ${SCRIPT_DIR}/service_utils.sh
+source ${SCRIPT_DIR}/../service_utils.sh
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
source ${SCRIPT_DIR}/service_utils.sh
source ${SCRIPT_DIR}/../service_utils.sh
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/common/eagle3/offline_training.sh` at line 19, The source path is
wrong: in offline_training.sh the variable SCRIPT_DIR points to
launcher/common/eagle3/, but service_utils.sh lives one level up; update the
sourcing in offline_training.sh so it references the correct location (e.g.,
resolve SCRIPT_DIR/../service_utils.sh or compute the parent dir via dirname and
then source service_utils.sh) — modify the source line that currently uses
SCRIPT_DIR to point to the parent directory where service_utils.sh actually
resides.

# --model is consumed here; args before "--" go to vllm serve, args after go to query.py.
MODEL=""
SERVE_EXTRA_ARGS=()
QUERY_ARGS=(--shard-id-begin ${TASK_ID} --shard-id-step ${TASK_COUNT})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Quote variables in array initialization.

Unquoted variable expansions in array initialization can cause word splitting issues.

🔧 Suggested fix
-QUERY_ARGS=(--shard-id-begin ${TASK_ID} --shard-id-step ${TASK_COUNT})
+QUERY_ARGS=(--shard-id-begin "${TASK_ID}" --shard-id-step "${TASK_COUNT}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
QUERY_ARGS=(--shard-id-begin ${TASK_ID} --shard-id-step ${TASK_COUNT})
QUERY_ARGS=(--shard-id-begin "${TASK_ID}" --shard-id-step "${TASK_COUNT}")
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 81-81: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)


[warning] 81-81: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.

(SC2206)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@launcher/common/vllm/query.sh` at line 81, The array initialization
QUERY_ARGS=(--shard-id-begin ${TASK_ID} --shard-id-step ${TASK_COUNT}) should
quote the variable expansions to avoid word splitting; update the QUERY_ARGS
assignment to use "--shard-id-begin" "${TASK_ID}" and "--shard-id-step"
"${TASK_COUNT}" (referencing QUERY_ARGS, TASK_ID, TASK_COUNT) so each option and
its value remain single array elements.

ChenhanYu and others added 7 commits March 14, 2026 19:44
Add tests/unit/launcher/ with 7 test files covering core dataclasses,
factory registry, global_vars, env merging, YAML formats, Docker
executor mounts, Slurm executor params (mocked), and end-to-end Docker
launch via subprocess.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Remove self-referential launcher/modules/Model-Optimizer submodule
(flagged in PR review as creating recursive nesting). Replace with a
symlink to ../.. (the Model-Optimizer root). The packager's find
follows symlinks so modelopt/* and examples/* are packaged correctly.

Verified: Docker launch with symlink works (quantize step finds modelopt).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Add launcher/.gitignore, CLAUDE.md. Update README with hf_local docs,
test instructions, verified results. Fix ADVANCED.md stale paths. Add
hf_local to GlobalVariables. Use <<global_vars.hf_local>> in YAML.
Remove stale services/* from packager. quantize.sh reads MMLU_DATASET
env var. launch.py auto-creates Model-Optimizer symlink.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Skip all launcher tests with pytest.skip when nemo_run is not available
(CI tox env doesn't have it). Add docstrings to __post_init__ and
_resolve for 100% docstring coverage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Move tests from tests/unit/launcher/ to launcher/tests/ for
self-containment. Add launcher job to unit_tests.yml using uv.
Add pytest.ini to override root pyproject.toml addopts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Switch from git-pinned nemo-run to nemo-run>=0.8.0 from PyPI (avoids
uv TOML parse error). Add py-modules=[] to prevent setuptools auto-
discovery. CI installs project with `uv pip install -e . pytest`.
Add ModelOpt mount mechanism docs to ADVANCED.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants